Prediction of the change in GDP Per Capita over time based off of Geographic Region¶

Gross Domestic Product (GDP) is a monetary value that represents the value of all goods and services sold within a particular geographical region. GDP per capita is the GDP of a region divided by its population. We will use a dataset available on Kaggle, that provides the GDP of counties in the United States over the time period from 2001 to 2018. This dataset was sourced from BEA.gov which is the website for the Bureau of Economic Analysis.

We can combine the data from this data set with census data to get the population from each county in order to calculate the GDP per capita of each county.

The goal of this tutorial will be to learn about the trends of GDP per capita in different regions in the United States and ultimately create a model that will be able to predict the change in the GDP per capita of a county given its geographical region in the country.

Part 1: Data Mangling¶

First we will import some python libraries that we will be using during the data analysis process, and then next we will import our data from the CSV file we downloaded into a Pandas DataFrame.

Once we have our data in DataFrame we can begin to take a look a what columns have been provided by the dataset.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
from sklearn import tree
from sklearn import model_selection
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
import plotly.express as px
from urllib.request import urlopen
import json
/opt/conda/lib/python3.9/site-packages/statsmodels/compat/pandas.py:65: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import Int64Index as NumericIndex
In [2]:
gdp_data = pd.read_csv("GDP by County.csv")
gdp_data
Out[2]:
index Year County FIPS Region SUB_REGION State STATE_ABBR County County Full Name GDP (Chained $)
0 0 2001 1001 Southeast East South Central Alabama AL Autauga Autauga County 954588000
1 1 2001 1003 Southeast East South Central Alabama AL Baldwin Baldwin County 3929753000
2 2 2001 1005 Southeast East South Central Alabama AL Barbour Barbour County 813278000
3 3 2001 1007 Southeast East South Central Alabama AL Bibb Bibb County 292693000
4 4 2001 1009 Southeast East South Central Alabama AL Blount Blount County 810381000
... ... ... ... ... ... ... ... ... ... ...
55558 55558 2018 56037 Rocky Mountain Mountain Wyoming WY Sweetwater Sweetwater County 3836603000
55559 55559 2018 56039 Rocky Mountain Mountain Wyoming WY Teton Teton County 2166420000
55560 55560 2018 56041 Rocky Mountain Mountain Wyoming WY Uinta Uinta County 906587000
55561 55561 2018 56043 Rocky Mountain Mountain Wyoming WY Washakie Washakie County 358104000
55562 55562 2018 56045 Rocky Mountain Mountain Wyoming WY Weston Weston County 315885000

55563 rows × 10 columns

Next, we will read in the CSV containing population estimates from the US Census. Notice this data set only contains data for the years between 2010 to 2019 so we will have to combine this dataset with another dataset we will get from the US Census that will give us the data we need for the rest of the years.

In [3]:
population_data = pd.read_csv("co-est2019-alldata.csv", encoding="ISO-8859-1")
population_data
Out[3]:
SUMLEV REGION DIVISION STATE COUNTY STNAME CTYNAME CENSUS2010POP ESTIMATESBASE2010 POPESTIMATE2010 ... RDOMESTICMIG2019 RNETMIG2011 RNETMIG2012 RNETMIG2013 RNETMIG2014 RNETMIG2015 RNETMIG2016 RNETMIG2017 RNETMIG2018 RNETMIG2019
0 40 3 6 1 0 Alabama Alabama 4779736 4780125 4785437 ... 1.917501 0.578434 1.186314 1.522549 0.563489 0.626357 0.745172 1.090366 1.773786 2.483744
1 50 3 6 1 1 Alabama Autauga County 54571 54597 54773 ... 4.847310 6.018182 -6.226119 -3.902226 1.970443 -1.712875 4.777171 0.849656 0.540916 4.560062
2 50 3 6 1 3 Alabama Baldwin County 182265 182265 183112 ... 24.017829 16.641870 17.488579 22.751474 20.184334 17.725964 21.279291 22.398256 24.727215 24.380567
3 50 3 6 1 5 Alabama Barbour County 27457 27455 27327 ... -5.690302 0.292676 -6.897817 -8.132185 -5.140431 -15.724575 -18.238016 -24.998528 -8.754922 -5.165664
4 50 3 6 1 7 Alabama Bibb County 22915 22915 22870 ... 1.385134 -4.998356 -3.787545 -5.797999 1.331144 1.329817 -0.708717 -3.234669 -6.857092 1.831952
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3188 50 4 8 56 37 Wyoming Sweetwater County 43806 43806 43574 ... -17.605427 0.502513 15.035735 -4.614071 -13.140225 -13.762031 -18.484164 -23.515727 -18.743773 -17.065527
3189 50 4 8 56 39 Wyoming Teton County 21294 21298 21296 ... -1.583464 -2.575509 1.905293 23.259519 12.508872 4.670450 1.253214 -0.171608 -12.432212 1.797445
3190 50 4 8 56 41 Wyoming Uinta County 21118 21121 21089 ... -9.181105 -18.196975 -4.487730 -10.870861 -15.033634 -10.484550 -11.485101 -18.923455 -13.554993 -9.181105
3191 50 4 8 56 43 Wyoming Washakie County 8533 8528 8530 ... -7.652085 -12.839390 -3.084589 -1.307811 -18.938032 0.000000 -15.204038 -15.950541 -17.246806 -8.289759
3192 50 4 8 56 45 Wyoming Weston County 7208 7208 7198 ... 0.000000 -9.065551 -10.549265 6.191515 0.420345 9.201171 0.693097 -36.227798 -6.910452 2.599090

3193 rows × 164 columns

Now we will read in the data for population estimates between the years 2001 to 2009. We can now merge our two population DataFrames using the pd.merge function. We can use this function to tell pandas to merge the columns in these two DataFrames based on when the CTYNAME and STNAME columns are the same.

Finally, we will clean up this DataFrame by removing columns we will not be using and renaming some columns to better names.

In [4]:
pop_2000_2009 = pd.read_csv("co-est00int-tot.csv", encoding="ISO-8859-1")

population_data = pd.merge(population_data, pop_2000_2009, on=["CTYNAME", "STNAME"])

population_data = population_data[
 [
    "STNAME",
    "CTYNAME",
    "CENSUS2010POP_x",
    "ESTIMATESBASE2010",
    "POPESTIMATE2010_x",
    "POPESTIMATE2011",
    "POPESTIMATE2012",
    "POPESTIMATE2013",
    "POPESTIMATE2014",
    "POPESTIMATE2015",
    "POPESTIMATE2016",
    "POPESTIMATE2017",
    "POPESTIMATE2018",
    "POPESTIMATE2019",
    "ESTIMATESBASE2000",
    "POPESTIMATE2000",
    "POPESTIMATE2001",
    "POPESTIMATE2002",
    "POPESTIMATE2003",
    "POPESTIMATE2004",
    "POPESTIMATE2005",
    "POPESTIMATE2006",
    "POPESTIMATE2007",
    "POPESTIMATE2008",
    "POPESTIMATE2009",
    "CENSUS2010POP_y",
    "POPESTIMATE2010_y",
 ]
]

population_data = population_data.rename(columns={"POPESTIMATE2010_x": "POPESTIMATE2010"})
population_data
Out[4]:
STNAME CTYNAME CENSUS2010POP_x ESTIMATESBASE2010 POPESTIMATE2010 POPESTIMATE2011 POPESTIMATE2012 POPESTIMATE2013 POPESTIMATE2014 POPESTIMATE2015 ... POPESTIMATE2002 POPESTIMATE2003 POPESTIMATE2004 POPESTIMATE2005 POPESTIMATE2006 POPESTIMATE2007 POPESTIMATE2008 POPESTIMATE2009 CENSUS2010POP_y POPESTIMATE2010_y
0 Alabama Alabama 4779736 4780125 4785437 4799069 4815588 4830081 4841799 4852347 ... 4480089 4503491 4530729 4569805 4628981 4672840 4718206 4757938 4779736 4785298
1 Alabama Autauga County 54571 54597 54773 55227 54954 54727 54893 54864 ... 45909 46800 48366 49676 51328 52405 53277 54135 54571 54632
2 Alabama Baldwin County 182265 182265 183112 186558 190145 194885 199183 202939 ... 147957 151509 156266 162183 168121 172404 175827 179406 182265 183195
3 Alabama Barbour County 27457 27455 27327 27341 27169 26937 26755 26283 ... 28653 28594 28287 28027 27861 27757 27808 27657 27457 27411
4 Alabama Bibb County 22915 22915 22870 22745 22667 22521 22553 22566 ... 21199 21399 21721 22042 22099 22438 22705 22941 22915 22867
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3186 Wyoming Sweetwater County 43806 43806 43574 43986 45002 45157 44948 44719 ... 37428 37450 38026 38739 39749 41470 42358 44133 43806 43621
3187 Wyoming Teton County 21294 21298 21296 21414 21624 22315 22773 23047 ... 18837 19066 19467 19632 20014 20472 20988 21232 21294 21302
3188 Wyoming Uinta County 21118 21121 21089 20896 20996 20951 20822 20763 ... 19587 19480 19470 19494 19709 20171 20613 21054 21118 21137
3189 Wyoming Washakie County 8533 8528 8530 8449 8409 8413 8273 8278 ... 7988 7976 7960 8022 7979 8169 8229 8423 8533 8556
3190 Wyoming Weston County 7208 7208 7198 7142 7077 7136 7138 7208 ... 6578 6610 6646 6594 6717 7033 7133 7266 7208 7182

3191 rows × 27 columns

Now that we have our population data and our GDP data we need to combine them to calculate the GDP per capita. We can do this by applying a function to all the rows along the GPD DataFrame that will calculate the GDP Per Capita.

This function can use the county name, state name, and year from the row to look up the correct population from the population DataFrame and calculate the GDP per capita using this value.

In [5]:
def get_year_column(year):
    return f"POPESTIMATE{year}"

def get_gdp_per_capita(r):
    county_name = r["County Full Name"]
    populations = population_data[(population_data["STNAME"] == r["State"]) & (population_data["CTYNAME"] == county_name)][get_year_column(r["Year"])]
    population = populations.iloc[0] if populations.size > 0 else np.nan
    return r["GDP (Chained $)"]/population if population is not np.nan else np.nan

def population_for_row(r):
    county_name = r["County Full Name"]
    populations = population_data[(population_data["STNAME"] == r["State"]) & (population_data["CTYNAME"] == county_name)][get_year_column(r["Year"])]
    population = populations.iloc[0] if populations.size > 0 else np.nan
    return population if population is not np.nan else np.nan
    
gdp_data["gdp_per_capita"] = gdp_data.apply(get_gdp_per_capita, axis=1)
gdp_data["population"] = gdp_data.apply(population_for_row, axis=1)

gdp_data
Out[5]:
index Year County FIPS Region SUB_REGION State STATE_ABBR County County Full Name GDP (Chained $) gdp_per_capita population
0 0 2001 1001 Southeast East South Central Alabama AL Autauga Autauga County 954588000 21265.521620 44889.0
1 1 2001 1003 Southeast East South Central Alabama AL Baldwin Baldwin County 3929753000 27125.128559 144875.0
2 2 2001 1005 Southeast East South Central Alabama AL Barbour Barbour County 813278000 28177.181859 28863.0
3 3 2001 1007 Southeast East South Central Alabama AL Bibb Bibb County 292693000 13919.202967 21028.0
4 4 2001 1009 Southeast East South Central Alabama AL Blount Blount County 810381000 15630.841933 51845.0
... ... ... ... ... ... ... ... ... ... ... ... ...
55558 55558 2018 56037 Rocky Mountain Mountain Wyoming WY Sweetwater Sweetwater County 3836603000 89518.946288 42858.0
55559 55559 2018 56039 Rocky Mountain Mountain Wyoming WY Teton Teton County 2166420000 93103.270446 23269.0
55560 55560 2018 56041 Rocky Mountain Mountain Wyoming WY Uinta Uinta County 906587000 44677.064853 20292.0
55561 55561 2018 56043 Rocky Mountain Mountain Wyoming WY Washakie Washakie County 358104000 45461.977910 7877.0
55562 55562 2018 56045 Rocky Mountain Mountain Wyoming WY Weston Weston County 315885000 45621.750433 6924.0

55563 rows × 12 columns

Before we move on to exploring our data, it is important to make sure we don't have any unexpected missing values in our data. We can see that there are four counties that we were unable to find the population in order to calculate their GDP per capita. We can manually check these counties in the census data and verify that we do not have a population estimate available.

Since we don't have the population of these counties, we can remove them from the dataset.

In [6]:
gdp_data[gdp_data["gdp_per_capita"].isnull()]["County"].unique()
Out[6]:
array(['Kusilvak', 'Oglala Lakota', 'LaSalle', 'Petersburg'], dtype=object)
In [7]:
gdp_data = gdp_data[~(gdp_data["gdp_per_capita"].isnull())]

Data Exploration¶

In this section, we will take a deeper look into what our GDP data looks like.

GDP Per Capita vs. Time plot.¶

We can begin by simply making a scatter plot of the GDP per capita of all counties vs. time.

In [8]:
ax = gdp_data.plot.scatter(x="Year", y="gdp_per_capita")
ax.set_title("GDP Per Capita vs. Time")
ax.set_ylabel("GDP Per Capita")
Out[8]:
Text(0, 0.5, 'GDP Per Capita')

Looking at this plot we can clearly see that there is one county that has a GDP per capita value that is much higher than the rest of the counties in the country. In fact, if we look at the axis of this plot, the GDP per capita of this is greater than $60,000,000. This is an extremely high value. It's so high in fact that this is likely an error in our dataset as opposed to simply an outlier. We can verify this by looking at what county has the max GDP per capita.

In [9]:
gdp_data[gdp_data['gdp_per_capita'] == gdp_data['gdp_per_capita'].max()]
Out[9]:
index Year County FIPS Region SUB_REGION State STATE_ABBR County County Full Name GDP (Chained $) gdp_per_capita population
54161 54161 2018 48301 Southwest West South Central Texas TX Loving Loving County 9557235000 6.457591e+07 148.0

If we do some research into Loving County, TX we can see that it is a very small county with a population of only 57 as of 2021! It is extremely unlikely that this county actually has a GDP per capita this high. So we can go ahead and remove this county from our dataset.

While we are at it, we can use a simple heuristic to remove any other outliers or errors from the data. We can remove a county if

$$gdpPerCapita > 200000 \land population < 10000$$

It is very unlikely that a county this small would have a GDP this high. Additionally, we can remove any county where the GDP per capita is greater than \$400000. Some quick research shows us that New York County has the highest GDP per capita and it is less than \\$400000.

In [10]:
gdp_data = gdp_data[gdp_data["County Full Name"] != "Loving County"]

gdp_data = gdp_data[(~((gdp_data["gdp_per_capita"] > 200000) & (gdp_data["population"] < 10000))) & (gdp_data["gdp_per_capita"] < 400000)]

ax = gdp_data.plot.scatter(x="Year", y="gdp_per_capita")
ax.set_title("GDP Per Captita vs. Time")
ax.set_ylabel("GDP Per Capita")
Out[10]:
Text(0, 0.5, 'GDP Per Capita')

Now the scatter plot looks much better and there are no obvious errors or outliers in the data.

It seems like there may be a positive trend in GDP per capita, but we will need to do some more analysis to confirm this suspicion.

Graphing a few specific counties.¶

In [11]:
counties = [("Los Angeles County", "CA"), ("Montgomery County", "MD"), ("Prince George's County", "MD"), ("New York County", "NY"), ("San Francisco County", "CA")]

for c, s in counties:
    ax = gdp_data[(gdp_data["County Full Name"] == c) & (gdp_data["STATE_ABBR"] == s)].sort_values(by="Year").plot(x="Year", y="gdp_per_capita")
    
    ax.set_title(f"{c}, {s} GDP per capita over time")
    plt.show()

For these specific counties, it appears that the GDP is increasing over time. The trend seems to be fairly linear, however, some counties Prince George's and Montgomery counties, seem to have a peak somewhere around 2005-2007 and then the growth slows down after that.

Another thing we can learn from these plots is that all of these plots are offset from each other. In other words, even though the GDP increases for all counties, some counties start off with a much higher GDP per capita value than other counties. Our model will need to account for this. We will come back and visualize this in a bit.

We can also look at a chart of the mean GDP per capita over time, to verify that this trend holds in general.

In [12]:
ax = gdp_data.groupby("Year")["gdp_per_capita"].mean().plot()
ax.set_title("Average GDP per capita over time")
ax.set_ylabel("Average GDP per capita")
Out[12]:
Text(0, 0.5, 'Average GDP per capita')

Graphs of GDP per capita in specific regions.¶

Next, we will take a look at plots of the GDP of specific geographic regions over time. If we are going to make a model that predicts changes in GDP per capita over time, based on geographical regions, we should look at what this data looks like first.

In [13]:
gdp_data["Region"].unique()
Out[13]:
array(['Southeast', 'Far West', 'Southwest', 'Rocky Mountain',
       'Great Lakes', 'Plains', 'New England', 'Mideast'], dtype=object)
In [14]:
regions = gdp_data["Region"].unique()
for region in regions:
    ax = gdp_data[gdp_data["Region"] == region].groupby("Year")["gdp_per_capita"].mean().plot()
    ax.set_title(f"{region} mean GDP per capita over time")
    ax.set_ylabel("Average GDP per capita")
    plt.show()

These plots tell us a couple of different things. First of we can see that the increase in GDP per capita seems to be more linear in some regions, such as the Mideast, than in other regions such as the Southeast. With this information, we can hypothesize that a linear regression model may perform worse than other regression models we may try.

We can also see that the average rate of increase varies by region as well as the average GDP per capita in each region at any given time.

Visualizing a GDP per capita baseline.¶

We can use the plotly python library to create a map that shows each county with a color on a gradient representing how high the GDP per capita is in that county compared to the rest of the counties in the united states.

In [15]:
from urllib.request import urlopen
import json
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
    counties = json.load(response)
    
# Add extra 0's to FIPS number if it's string representation is less than 5 characters.
gdp_data["County FIPS"] = gdp_data["County FIPS"].map(lambda x: str(x) if len(str(x)) >= 5 else "0"*(5-len(str(x))) + str(x))

fig = px.choropleth_mapbox(gdp_data[gdp_data["Year"] == 2018], geojson=counties, locations='County FIPS', color='gdp_per_capita',
                           mapbox_style="carto-positron",
                           zoom=2.5, center = {"lat": 37.0902, "lon": -95.7129},
                           opacity=0.5,
                           labels={'gdp_per_capita':'GDP Per Capita'})

fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

Looking at this map it is clear to see that some counties have GDPs per capita that are much higher than the surrounding counties. If we explore the map a bit more, we can see that cities in particular have higher GDPs per capita than surrounding counties. Take a look at Manhatten for example. Manhatten has a GDP per capita of around \$368,000 whereas surrounding counties have GDP per capitas in the range of \\$25,000-\$75,000

Adding a Baseline GDP feature.¶

Based off of this information, it makes sense to add a new feature to our dataset that will act as a baseline GDP per capita for each county. We can just use the GDP per capita from 2001 for this value.

In [16]:
def get_base_gdp(r):
    county = gdp_data[gdp_data["County Full Name"] == r["County Full Name"]]
    base_gdp = county[(county["State"] == r["State"])  & (county["Year"] == 2001)]["gdp_per_capita"]
    
    base_gdp = base_gdp.iloc[0] if base_gdp.size > 0 else np.nan
    
    return base_gdp

gdp_data["base_gdp"] = gdp_data.apply(get_base_gdp, axis=1)
In [17]:
gdp_data
Out[17]:
index Year County FIPS Region SUB_REGION State STATE_ABBR County County Full Name GDP (Chained $) gdp_per_capita population base_gdp
0 0 2001 01001 Southeast East South Central Alabama AL Autauga Autauga County 954588000 21265.521620 44889.0 21265.521620
1 1 2001 01003 Southeast East South Central Alabama AL Baldwin Baldwin County 3929753000 27125.128559 144875.0 27125.128559
2 2 2001 01005 Southeast East South Central Alabama AL Barbour Barbour County 813278000 28177.181859 28863.0 28177.181859
3 3 2001 01007 Southeast East South Central Alabama AL Bibb Bibb County 292693000 13919.202967 21028.0 13919.202967
4 4 2001 01009 Southeast East South Central Alabama AL Blount Blount County 810381000 15630.841933 51845.0 15630.841933
... ... ... ... ... ... ... ... ... ... ... ... ... ...
55558 55558 2018 56037 Rocky Mountain Mountain Wyoming WY Sweetwater Sweetwater County 3836603000 89518.946288 42858.0 104387.354671
55559 55559 2018 56039 Rocky Mountain Mountain Wyoming WY Teton Teton County 2166420000 93103.270446 23269.0 83380.903876
55560 55560 2018 56041 Rocky Mountain Mountain Wyoming WY Uinta Uinta County 906587000 44677.064853 20292.0 50759.748622
55561 55561 2018 56043 Rocky Mountain Mountain Wyoming WY Washakie Washakie County 358104000 45461.977910 7877.0 42887.704512
55562 55562 2018 56045 Rocky Mountain Mountain Wyoming WY Weston Weston County 315885000 45621.750433 6924.0 42663.789117

55007 rows × 13 columns

Again we can go through and remove any counties in which we were not able to get a base GDP per capita.

In [18]:
gdp_data = gdp_data[~(gdp_data['base_gdp'].isnull())]

Model Creation¶

Now that we have looked at and explored the data, we can make a model that will predict a county's GDP per capita based off of three features: the year, the county's region, and the county's baseline GDP per capita.

Decision Tree Regressor¶

We saw during the data exploration that some regions of the united states have GDP trends that are not exactly linear. We will first use a decision tree regressor and then compare the model we get to a linear regression model. If we are correct that there are nonlinear trends in our data, the decision tree regressor should perform better.

In order to use the decision tree classifier we need to transform our input data a bit. The decision tree classifier expects numerical inputs, but one of the features (Region) is categorical. We will use a ColumnTransformer to transform just the Region column using a OneHotEncoder from the scikit learn library.

This will turn our categorical data into a list of numerical values that the decision tree model will be able to work with.

In [19]:
enc = OneHotEncoder()
transform = ColumnTransformer([
    ('one-hot', enc, ["Region"]),
    ('pass', 'passthrough', ['Year', 'base_gdp']),
])

X_train, X_test, y_train, y_test = model_selection.train_test_split(transform.fit_transform(gdp_data[["Year", "Region", "base_gdp"]]), gdp_data["gdp_per_capita"], test_size=.3)

clf = tree.DecisionTreeRegressor()
clf.fit(X_train, y_train)

clf.score(X_test, y_test)
Out[19]:
0.9119436191700345

After splitting the data into test and training sets, training the mode, and then using the test data to evaluate the model. We can see that we get an $R^2$ score of around 0.912. This value is very close to 1 which indicates that the decision tree model is a good fit for our data.

Linear Regression Model¶

Now we can create a linear regression model that uses the year and the region features and has another term with the baseline GDP added on.

In [20]:
smf.ols(formula="gdp_per_capita ~ Year*C(Region) + base_gdp", data=gdp_data).fit().summary()
Out[20]:
OLS Regression Results
Dep. Variable: gdp_per_capita R-squared: 0.702
Model: OLS Adj. R-squared: 0.702
Method: Least Squares F-statistic: 8076.
Date: Fri, 16 Dec 2022 Prob (F-statistic): 0.00
Time: 14:32:59 Log-Likelihood: -5.9822e+05
No. Observations: 54946 AIC: 1.196e+06
Df Residuals: 54929 BIC: 1.197e+06
Df Model: 16
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -1.083e+06 9.01e+04 -12.021 0.000 -1.26e+06 -9.06e+05
C(Region)[T.Great Lakes] 2.797e+05 1.06e+05 2.630 0.009 7.13e+04 4.88e+05
C(Region)[T.Mideast] -4.51e+04 1.26e+05 -0.357 0.721 -2.93e+05 2.03e+05
C(Region)[T.New England] 4.892e+05 1.7e+05 2.875 0.004 1.56e+05 8.23e+05
C(Region)[T.Plains] -1.039e+06 1.02e+05 -10.202 0.000 -1.24e+06 -8.4e+05
C(Region)[T.Rocky Mountain] 4.924e+04 1.21e+05 0.407 0.684 -1.88e+05 2.86e+05
C(Region)[T.Southeast] 8.354e+05 9.74e+04 8.573 0.000 6.44e+05 1.03e+06
C(Region)[T.Southwest] -4.594e+05 1.1e+05 -4.189 0.000 -6.74e+05 -2.44e+05
Year 542.4089 44.821 12.102 0.000 454.559 630.259
Year:C(Region)[T.Great Lakes] -140.4507 52.921 -2.654 0.008 -244.177 -36.725
Year:C(Region)[T.Mideast] 22.0359 62.870 0.351 0.726 -101.189 145.261
Year:C(Region)[T.New England] -244.5586 84.692 -2.888 0.004 -410.555 -78.562
Year:C(Region)[T.Plains] 519.0740 50.704 10.237 0.000 419.693 618.455
Year:C(Region)[T.Rocky Mountain] -24.7407 60.230 -0.411 0.681 -142.792 93.310
Year:C(Region)[T.Southeast] -417.8386 48.490 -8.617 0.000 -512.879 -322.798
Year:C(Region)[T.Southwest] 228.7194 54.565 4.192 0.000 121.771 335.668
base_gdp 0.9679 0.003 336.271 0.000 0.962 0.974
Omnibus: 65098.985 Durbin-Watson: 1.899
Prob(Omnibus): 0.000 Jarque-Bera (JB): 27413462.913
Skew: 5.827 Prob(JB): 0.00
Kurtosis: 111.803 Cond. No. 1.92e+08


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.92e+08. This might indicate that there are
strong multicollinearity or other numerical problems.

After fitting our linear regression model we get an $R^2$ value of 0.702. While this value is not bad, it is significantly lower than the $R^2$ value we got from the decision tree regressor. This tells us that we were correct in our hypothesis that the trends in GDP per capita are not completely linear and that a non linear model will perform better than a linear regressor.

Conclusion¶

Going through the data analysis process in this tutorial has given us a lot of insights into the trends that we can see in GDP across the country. In the process, we also learned a lot about the data science process in general. We learned about reading, cleaning, and combining data from multiple sources. We also walked through how to analyze/explore data and how we can gain key information from this process that will later help us create an accurate model for our data. Finally, we actually went through the model creation process twice and compared two different models, and then considered how the different models performed against our expectations.

All the data and python notebooks can be found here.